General Information about the dataset

Citation Request: This dataset is public available for research. The details are described in [Cortez et al., 2009]. Please include this citation if you plan to use this database:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

  1. Title: Wine Quality

  2. Sources Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009

  3. Past Usage:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).

  1. Relevant Information:

The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

  1. Number of Instances: red wine - 1599; white wine - 4898.

  2. Number of Attributes: 11 + output attribute

Note: several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.

  1. Attribute information:

For more information, read [Cortez et al., 2009].

Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)

  1. Missing Attribute Values: None

  2. Description of attributes:

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output variable (based on sensory data): 12 - quality (score between 0 and 10)


Univariate Plots

## Loading required package: ggplot2
## [1] 4898   21
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"              "qual"                 "quality3"            
## [16] "quality4"             "quality5"             "quality6"            
## [19] "quality7"             "quality8"             "quality9"
## 'data.frame':    4898 obs. of  21 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ qual                : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ quality3            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ quality4            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ quality5            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ quality6            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ quality7            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ quality8            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ quality9            : num  0 0 0 0 0 0 0 0 0 0 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##                                                                   
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##                                                        
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##                                                                        
##     alcohol      quality       qual          quality3       
##  Min.   : 8.00   3:  20   Min.   :3.000   Min.   :0.000000  
##  1st Qu.: 9.50   4: 163   1st Qu.:5.000   1st Qu.:0.000000  
##  Median :10.40   5:1457   Median :6.000   Median :0.000000  
##  Mean   :10.51   6:2198   Mean   :5.878   Mean   :0.004083  
##  3rd Qu.:11.40   7: 880   3rd Qu.:6.000   3rd Qu.:0.000000  
##  Max.   :14.20   8: 175   Max.   :9.000   Max.   :1.000000  
##                  9:   5                                     
##     quality4          quality5         quality6         quality7     
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.03328   Mean   :0.2975   Mean   :0.4488   Mean   :0.1797  
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                      
##     quality8          quality9       
##  Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.00000   Median :0.000000  
##  Mean   :0.03573   Mean   :0.001021  
##  3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.00000   Max.   :1.000000  
## 

The worst quality is 3, the best quality is 9 in the dataset, to get a better understanding of the quality ranking we plot a histogram.

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

## 
##                8              8.4              8.5              8.6 
##                2                3                9               23 
##              8.7              8.8              8.9                9 
##               78              107               95              185 
##              9.1              9.2              9.3              9.4 
##              144              199              134              229 
##              9.5 9.53333333333333             9.55              9.6 
##              228                3                2              128 
## 9.63333333333333              9.7 9.73333333333333             9.75 
##                1              105                2                1 
##              9.8              9.9               10 10.0333333333333 
##              136              109              162                1 
##             10.1 10.1333333333333            10.15             10.2 
##              114                2                3              130 
##             10.3             10.4 10.4666666666667             10.5 
##               85              153                2              160 
## 10.5333333333333            10.55 10.5666666666667             10.6 
##                1                2                1              114 
##            10.65             10.7             10.8             10.9 
##                1               96              135               88 
## 10.9333333333333 10.9666666666667            10.98               11 
##                2                3                1              158 
##            11.05 11.0666666666667             11.1             11.2 
##                2                1               83              112 
## 11.2666666666667             11.3 11.3333333333333            11.35 
##                1              101                3                1 
## 11.3666666666667             11.4 11.4333333333333            11.45 
##                1              121                1                4 
## 11.4666666666667             11.5            11.55             11.6 
##                1               88                1               46 
## 11.6333333333333            11.65             11.7 11.7333333333333 
##                2                1               58                1 
##            11.75             11.8            11.85             11.9 
##                2               60                1               53 
##            11.94            11.95               12            12.05 
##                2                1              102                1 
## 12.0666666666667             12.1            12.15             12.2 
##                1               51                2               86 
##            12.25             12.3 12.3333333333333             12.4 
##                1               62                1               68 
##             12.5             12.6             12.7            12.75 
##               83               63               56                3 
##             12.8 12.8933333333333             12.9               13 
##               54                2               39               36 
##            13.05             13.1 13.1333333333333             13.2 
##                1               18                1               14 
##             13.3             13.4             13.5            13.55 
##                7               20               12                1 
##             13.6             13.7             13.8             13.9 
##                9                7                2                3 
##               14            14.05             14.2 
##                5                1                1

## 
## 0.22 0.23 0.25 0.26 0.27 0.28 0.29  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 
##    1    1    4    4   13   13   16   31   35   54   59   84   85  120  129 
## 0.38 0.39  0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 
##  214  151  168  139  181  161  216  178  225  172  179  166  249  140  156 
## 0.53 0.54 0.55 0.56 0.57 0.58 0.59  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 
##  135  167  102  108   83   99   97   88   45   68   48   67   28   36   35 
## 0.68 0.69  0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79  0.8 0.81 0.82 
##   44   30   27   18   33   12   19   22   19   16   19   16    5    5   13 
## 0.83 0.84 0.85 0.86 0.87 0.88 0.89  0.9 0.92 0.94 0.95 0.96 0.97 0.98 0.99 
##    2    4    3    2    2    7    1    5    2    2    5    3    1    6    1 
##    1 1.01 1.06 1.08 
##    1    1    1    1

## 
## 2.72 2.74 2.77 2.79  2.8 2.82 2.83 2.84 2.85 2.86 2.87 2.88 2.89  2.9 2.91 
##    1    1    1    3    3    1    4    1    9    9    9   11   17   31   15 
## 2.92 2.93 2.94 2.95 2.96 2.97 2.98 2.99    3 3.01 3.02 3.03 3.04 3.05 3.06 
##   18   38   35   26   63   32   41   68   74   49   68   78   97   89  115 
## 3.07 3.08 3.09  3.1 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19  3.2 3.21 
##   79  136   92  135  126  134  117  172  136  164  124  138  145  137   95 
## 3.22 3.23 3.24 3.25 3.26 3.27 3.28 3.29  3.3 3.31 3.32 3.33 3.34 3.35 3.36 
##  146  116  132  114   96   88   87   82   93   79   86   49   79   48   83 
## 3.37 3.38 3.39  3.4 3.41 3.42 3.43 3.44 3.45 3.46 3.47 3.48 3.49  3.5 3.51 
##   49   58   40   39   30   48   20   33   17   28   21   21   23   15   14 
## 3.52 3.53 3.54 3.55 3.56 3.57 3.58 3.59  3.6 3.61 3.62 3.63 3.64 3.65 3.66 
##   17   13   14    9    8    5    5    6    7    3    1    6    2    4    5 
## 3.67 3.68 3.69  3.7 3.72 3.74 3.75 3.76 3.77 3.79  3.8 3.81 3.82 
##    1    2    2    1    3    2    2    2    2    1    2    1    1





## 
##  0.6  0.7  0.8  0.9 0.95    1 
##    2    7   25   39    4   93



## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Source: local data frame [7 x 6]
## 
##   quality mean_alcohol  mean_pH mean_density mean_chlorides    n
## 1       3     10.34500 3.187500    0.9948840     0.05430000   20
## 2       4     10.15245 3.182883    0.9942767     0.05009816  163
## 3       5      9.80884 3.168833    0.9952626     0.05154633 1457
## 4       6     10.57537 3.188599    0.9939613     0.04521747 2198
## 5       7     11.36794 3.213898    0.9924524     0.03819091  880
## 6       8     11.63600 3.218686    0.9922359     0.03831429  175
## 7       9     12.18000 3.308000    0.9914600     0.02740000    5

Univariate Analysis

What is the structure of your dataset?

There are 4898 different red wine variants with 13 features:
  • fixed acidity
  • volatile acidity
  • critic acid
  • residual sugar
  • chlorides
  • free sulfur dioxide
  • total sulfur dioxide
  • density
  • pH
  • sulphates
  • alcohol
  • quality

The variable quality is ordered factor variables with the following levels.

(worst) … (best)
quality: 0, 1, 2 ,3, 4, 5, 6, 7, 8, 9, 10

Other observations:

The median of the quality is 6. In alcohol there is a spike at 9.5 in residual sugar there is also a spike at 2.

If you look back to the quality data we saw that 1457 red wines get quality 5, 2198 wines get a 6 and 880 wines get a 7. Now it is interesting to see that the distribution for quality is skewed.

The distribution for alcohol, sulphates, chlorides, residual sugar, volatile acidity are also skewed. That is my objective opinion by looking to the distribition charts.

What is/are the main feature(s) of interest in your dataset?

What I know as a wine “expert” pH, alcohol and sugar has a big impact to the wine quality. So I gess this parameters should have the most impact to the predictive model to quality of red wine, thats my personal opinion.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

By comparing the distributions it could be that chlorides and volatile acidity has am impact on it.

Did you create any new variables from existing variables in the dataset?

First I changed the type of the variable quality from int to factor. In the dataset quality is the only categorical factor.


Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to dplyr, adjust, or change the form of the data? If so, why did you do this?

The histogram for the variable critic.acid strainge because there is a spike at level 0.5

I used dplyr to group the values per quality and calulate the mean values for some choosen parameter.

I plot the data between the quantiles 1% and 99% to increase some huge spikes in the plot.


Bivariate Plots Section

##                                 X fixed.acidity volatile.acidity
## X                     1.000000000   -0.25581431      0.002857966
## fixed.acidity        -0.255814305    1.00000000     -0.022697290
## volatile.acidity      0.002857966   -0.02269729      1.000000000
## citric.acid          -0.149899918    0.28918070     -0.149471811
## residual.sugar        0.006623775    0.08902070      0.064286060
## chlorides            -0.045645192    0.02308564      0.070511571
## free.sulfur.dioxide  -0.011928911   -0.04939586     -0.097011939
## total.sulfur.dioxide -0.161979037    0.09106976      0.089260504
## density              -0.185976097    0.26533101      0.027113845
## pH                   -0.115774132   -0.42585829     -0.031915368
## sulphates             0.009807759   -0.01714299     -0.035728147
## alcohol               0.213656245   -0.12088112      0.067717943
## qual                  0.035763247   -0.11366283     -0.194722969
##                       citric.acid residual.sugar   chlorides
## X                    -0.149899918    0.006623775 -0.04564519
## fixed.acidity         0.289180698    0.089020701  0.02308564
## volatile.acidity     -0.149471811    0.064286060  0.07051157
## citric.acid           1.000000000    0.094211624  0.11436445
## residual.sugar        0.094211624    1.000000000  0.08868454
## chlorides             0.114364448    0.088684536  1.00000000
## free.sulfur.dioxide   0.094077221    0.299098354  0.10139235
## total.sulfur.dioxide  0.121130798    0.401439311  0.19891030
## density               0.149502571    0.838966455  0.25721132
## pH                   -0.163748211   -0.194133454 -0.09043946
## sulphates             0.062330940   -0.026664366  0.01676288
## alcohol              -0.075728730   -0.450631222 -0.36018871
## qual                 -0.009209091   -0.097576829 -0.20993441
##                      free.sulfur.dioxide total.sulfur.dioxide     density
## X                          -0.0119289106         -0.161979037 -0.18597610
## fixed.acidity              -0.0493958591          0.091069756  0.26533101
## volatile.acidity           -0.0970119393          0.089260504  0.02711385
## citric.acid                 0.0940772210          0.121130798  0.14950257
## residual.sugar              0.2990983537          0.401439311  0.83896645
## chlorides                   0.1013923521          0.198910300  0.25721132
## free.sulfur.dioxide         1.0000000000          0.615500965  0.29421041
## total.sulfur.dioxide        0.6155009650          1.000000000  0.52988132
## density                     0.2942104109          0.529881324  1.00000000
## pH                         -0.0006177961          0.002320972 -0.09359149
## sulphates                   0.0592172458          0.134562367  0.07449315
## alcohol                    -0.2501039415         -0.448892102 -0.78013762
## qual                        0.0081580671         -0.174737218 -0.30712331
##                                 pH    sulphates     alcohol         qual
## X                    -0.1157741316  0.009807759  0.21365624  0.035763247
## fixed.acidity        -0.4258582910 -0.017142985 -0.12088112 -0.113662831
## volatile.acidity     -0.0319153683 -0.035728147  0.06771794 -0.194722969
## citric.acid          -0.1637482114  0.062330940 -0.07572873 -0.009209091
## residual.sugar       -0.1941334540 -0.026664366 -0.45063122 -0.097576829
## chlorides            -0.0904394560  0.016762884 -0.36018871 -0.209934411
## free.sulfur.dioxide  -0.0006177961  0.059217246 -0.25010394  0.008158067
## total.sulfur.dioxide  0.0023209718  0.134562367 -0.44889210 -0.174737218
## density              -0.0935914935  0.074493149 -0.78013762 -0.307123313
## pH                    1.0000000000  0.155951497  0.12143210  0.099427246
## sulphates             0.1559514973  1.000000000 -0.01743277  0.053677877
## alcohol               0.1214320987 -0.017432772  1.00000000  0.435574715
## qual                  0.0994272457  0.053677877  0.43557472  1.000000000
## [1] "fixed.acidity"        "volatile.acidity"     "residual.sugar"      
## [4] "chlorides"            "total.sulfur.dioxide" "density"             
## [7] "pH"                   "alcohol"
## 
## Attaching package: 'GGally'
## 
## The following object is masked from 'package:dplyr':
## 
##     nasa


Check some correlation for possible model parameters

## [1] 0.06428606
## Warning in loop_apply(n, do.ply): Removed 214 rows containing missing
## values (geom_point).

## [1] -0.3601887
## Warning in loop_apply(n, do.ply): Removed 203 rows containing missing
## values (geom_point).

## [1] -0.7801376


Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Alcohol and chlorides has a high negative correlation, comparted to the other correlation factor, so I will reject this parameter for a linear model (because alcohol and quality has a higher correlation than chlorides and quality).

There is a very strong negative correlation between alcohol and density of -0.78, for building a linear model I would not use these two variables together because of the high correlation.

And total.sulfur.dioxide has a high psitive correlation to alcohol, so I would also reject this for building a linear model.


Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The dataset has a high number for quality nr. 5 and nr. 6


What was the strongest relationship you found?

The strongest relationship for building a model to predict the quality of red wine is density with correlation 0.44

Multivariate Plots Section


## Loading required package: lattice
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## 
## Attaching package: 'memisc'
## 
## The following objects are masked from 'package:dplyr':
## 
##     collect, query, rename
## 
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## 
## The following object is masked from 'package:base':
## 
##     as.array
## 
## Calls:
## lin: lm(formula = as.numeric(quality) ~ alcohol, data = wqw[, 2:13])
## lin2: lm(formula = as.numeric(quality) ~ alcohol + pH, data = wqw[, 
##     2:13])
## lin3: lm(formula = as.numeric(quality) ~ alcohol + pH + volatile.acidity, 
##     data = wqw[, 2:13])
## lin4: lm(formula = as.numeric(quality) ~ alcohol + pH + volatile.acidity + 
##     residual.sugar, data = wqw[, 2:13])
## 
## =========================================================
##                      lin      lin2      lin3      lin4   
## ---------------------------------------------------------
## (Intercept)        0.582*** -0.258     0.337    -0.771** 
##                   (0.098)   (0.250)   (0.245)   (0.259)  
## alcohol            0.313***  0.309***  0.321***  0.373***
##                   (0.009)   (0.009)   (0.009)   (0.010)  
## pH                           0.277***  0.224**   0.355***
##                             (0.076)   (0.074)   (0.074)  
## volatile.acidity                      -1.966*** -2.094***
##                                       (0.110)   (0.109)  
## residual.sugar                                   0.028***
##                                                 (0.002)  
## ---------------------------------------------------------
## R-squared             0.190     0.192     0.242     0.262
## adj. R-squared        0.190     0.192     0.241     0.261
## sigma                 0.797     0.796     0.771     0.761
## F                  1146.395   581.296   519.857   434.378
## p                     0.000     0.000     0.000     0.000
## Log-likelihood    -5839.391 -5832.739 -5677.165 -5610.423
## Deviance           3112.257  3103.815  2912.775  2834.466
## AIC               11684.782 11673.479 11364.330 11232.847
## BIC               11704.272 11699.465 11396.813 11271.826
## N                  4898      4898      4898      4898    
## =========================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The most important parameter for predicting quality is alcohol and volatile.acidity that will be shown in the linear model.

Were there any interesting or surprising interactions between features?

Yes it was very surprising that all high correlated parameter with quality has a high correlation with alcohol, for example
quality <-> total.sulfur.dioxide <-> alcohol
quality <-> density <-> alcohol
quality <-> chlorides <-> alcohol

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes I build a linear model with the paramter alcohol, volatile.acidity, residual.sugar and pH.

The best linear model has a R-squared of 0.262, that is very bad.

Looks like the data has a good representation for quality 5 and 6 but for the other 8 qualities we have too less data to create a better model.

Final Plots and Summary

Plot One

## Loading required package: grid

Description One

The left plot shows the relationship that good wines have less alcohol and not so good wines have more alcohol. The right plot shows that the smaller the pH value the better the wine quality.

Plot Two

## Warning in loop_apply(n, do.ply): Removed 157 rows containing missing
## values (stat_smooth).
## Warning in loop_apply(n, do.ply): Removed 200 rows containing missing
## values (geom_point).
## Warning in loop_apply(n, do.ply): Removed 2 rows containing missing values
## (geom_path).

## [1] -0.7801376

Description Two

The plot shows the highest (negative) correlation in the dataset between alcohol and density of -0.78

Plot Three

## [1] 0.4355747

Description Three

The parameter alcohol describes the quality of wine best, in the dataset we have 7 different values for quality and 103 different values for alcohol this fact makes it a little bit hard for creating a better plot.

Reflection

It was a nice experience to work with that dataset. In the beginning I was happy to deal with no factors, on a second look I realized that the variable quality is a factor but it is used as integer. First I plot all histograms to get a idea of the dataset. There are 10 different qualities factors, this dataset uses only three (meaning that for three different categories more than 800 datas are available). The second part analyzes the correlations, here I was very surprised to see that alcohol has a high correlation to the same parameters than quality to. That makes it very hard to find the parameters for an linear model. In the beginning I thought pH, sugar and alcohol are the mean parameters of the quality but the data tell a different story. By choosing the parameter alcohol, pH, volatile.acidity and residual.sugar I created a linear model with R-squared of 0.26. Thats a very bad result. After checking all other parameter to create a linear model and double checking the analysis I found two possible answers for myself. I) The dataset has too less values for the different quality categories to get a representive result or II) the objective parameter quality does not fit with the chemical parameters It would be very intersting to get a dataset with more data per quality to find out which adoption is correct.